Cross-language analysis of world regions in the press

An empirical approach based on Wikidata

Claude Grasland & Etienne Toureille

19/10/2021

1. INTRODUCTION

Previous analysis on german (left) and french (right) newspapers has demonstrated the interest to analyse networks of states and world regions :

But before to validate this results we need :

  1. to clarify our definition of world regions and the associated list of target units.
  2. to enlarge the dictionary to other languages (turkish, arabic, english)

Objectives

The objective of this short note is to explore the possibility of Wikidata for the production of multilingual dictionaries of world regions and more generally regional imaginations. In order to test the interest of this approach, we will try to produce multilingual dictionaries for the identification of different types of “regions” related to the division of the Earth (“natural”) or the division of the World (“political”)

Earth/Natural regions: Atlas

So-called “Physical maps” in Atlas are a good source :

Earth/natural regions : Textbooks

Textbooks and educative games for children are also crucial :

World/Political regions : IGO

An attempt to classify intergovernmental organization in 4 types :

  1. Continental organizations
  2. Subcontinental organizations
  3. Transcontinental alliances
  4. Heritage of empires

Source : https://commons.wikimedia.org/wiki/Atlas_of_international_organizations

World/Political regions : Other …

A cross-language perspective

We propose to etablish a dictionary of Earth and World Regions in the five languages of interest for the project IMAGEUN :

We want to avoid any “eurocentric” or “anglocentric” perspective in the definition of entities. Therefore our definition of entities will follow the following rules :

  1. Non universal : Entities will not necessary be available in all languages
  2. Non equivalent : Translation of names does not imply equivalence of entities
  3. Non hierarchic : An entity has different definitions in each language. None of the language can be considered as “pivot” or “reference.”

Entities equivalences and lexical universes

To summarize, we propose to build partial equivalences between entities that belong to different lexical universes.

The comparison between lexical universes will be necessarily limited to a small sample of entities for which we can assume that the entities are approximately equivalent.

2. WIKIDATA

Wikidata defines itself as

Codification of entities

The first interest of wikidata is to provide unique code of identifications of objects. For example a research about “Africa” will produce a list of different objects characterized by a unique code :

Informations on entities

Once we have selected an entity (e.g. Q15) we obtain a new page with more detailed informations in english but also in all other languages available in Wikipedia.

Informations on entities

A lot of information are available concerning the entity but, at this stage, the most important ones for our research are :

  1. the translation in different languages
  2. the equivalent words or expression in different languages
  3. the definitions in different languages
  4. the ambiguity of the term in each language and the potential risks of confusion with other entities.

Of course we should not take for granted the answers proposed by wikidata (as noticed by Georg, Wikipedia is a matter of research for IMAGEUN …) but without any doubt, it offers a very good opportunity to clarify our questions and help us to build tools for recognition of world regions and other geographical imaginations in a multilingual perspective.

Multilanguage defintions

A wikipedia entity like Q15 is an element of an ontology designed by its author for specific purposes. The specificity of the wikidata ontology is the fact that it is a multilinligual web where Q15 is a node of the web present in different linguistic layers. It means that we don’t have a single name or a single definition of Q15, except if we adopt the neocolonial perspective to choose the english language as reference. Depending on the context (i.e. the language or sub-language), Q15 could be defined as :

language definition
fr A continent named Afrique
en A continent on the Earth’s northern and southern hemispheres named Africa or African continent
de A “Kontinent auf der Nord- und Südhalbkugel der Erde” named “Afrika”
tr A “Dünya nin kuzey ve güney yarikürelerindeki bir kita” named “Afrika” or “Afrika kitasi”
ar The second largest continent in the world in terms of area and population, comes second only to Asia (trad.)

Correspondance between entities ?

The existence of the same code of wikipedia entities does not offer any guarantee of concordance between the geographical objects found in news published in different languages or different countries. But - and it is the important point - it help us to point similarities and differences between set of geographical entities that are more or less comparable in each language.

Cross-language perspective

Having in mind the limits of the equivalence of entities across languages, it can nevertheless be an interesting experience to select a set of wikipedia entities (Q15, Q258, Q4412 …) and to examine their relative frequency in our different media from different countries with different languages. A typical hypothesis could be something like :

which is not equivalent to the question

but rather equivalent to the two joint questions

Boring details

The package WikidataR is an interface for the use of the Wikidata API in R language. Equivalent tools are available in Python and other languages for those non familiar with R. And it is of course possible to use directly the API. The first step is to install the most recent version of the R package WikidataR which install also related packages of interest.

#install.packages("WikidataR")
library(WikidataR)

Boring details

If we start our research with the wordAfrique” in french language we find more than 50 entities that contain this word in their label. Only the first 10 are presented below :

item_id item_label item_desc item_lang item_text
Q15 Africa continent on the Earth’s northern and southern hemispheres fr Afrique
Q181238 Africa Roman province on the northern African coast covering parts of present-day Tunisia, Algeria, and Libya fr Afrique
Q203548 African Plate continental plate underlying Africa fr Afrique
Q258 South Africa sovereign state in Southern Africa fr Afrique du Sud
Q4412 West Africa region of Africa fr Afrique de l’Ouest
Q132959 Sub-Saharan Africa area of the continent of Africa that lies south of the Sahara Desert fr Afrique subsaharienne
Q27394 Southern Africa southernmost region of the African continent fr Afrique australe
Q27407 East Africa easterly region of the African continent fr Afrique de l’Est
Q27381 North Africa northernmost region of the African continent fr Afrique du Nord
Q2826196 Afrique Wikimedia disambiguation page fr Afrique

Boring details

The analysis of the list of result reveals four situations :

  1. Target entities: A first list is related to entities that can be considered as world regions or geographical imaginations of interest for IMAGEUN. It is typically the case for the whole continent of Afrique (Q15) and its different subdivisions like North Africa (Q27381), West Africa (Q4412), Sub-Saharan Africa (Q132959).

  2. Control entities : A list of entities that are not regions but should be controled if we want to identify our target entities. A typical example is the sovereign state of South Africa (Q258) which will necessary introduce mistakes in the identification of Africa as a continent if it is not controled. The problem will not necessary exist in all languages (e.g. German) but is important.

  3. Ambiguous entities : Some entities are ambiguous because they are not regions but use exactly the same textual units than a target entity. It is for example the case of the roman province of Africa (Q181238) which can not be easily differentiated from the continent, except by manual inspection. This units are not easy to control but fortunately are generally not frequent.

  4. Insignificant entities : Those entities that are exceptional inthe corpus can be simply gnored.

Workflow in a nutschell

We propose a semi-automatic method of extractions of entities in different languages that implies the presence of human expert at each step of the analysis. The figure below describe an example of research of world regions related to Africa in three languages.

The programs used for computer implementation are explained in the media cookbook on github with an example of implementation available onf the following page

3. Q & D EXPERIMENTS

We have realized a test of the previous workflow on an arbitraty selection of entities which are mainly related to continent and “natural” Earth divisions :

This quick and dirty analysis does not offer any guarantee of quality because :

  1. The list of entities has not been discussed by the IMAGEUN’s partners
  2. The dictionary established in the different languages has not been controled by native speakers

The purpose is therefore only to provide food for thought.

Data

We start from a corpus of text where target wikipedia entities has been recognized :

text source date regs nbregs
Asie, Afrique, Europe: la nouvelle stratégie de l’État islamique fr_FRA_figaro 2019-05-03 Q48 Q15 Q46 3
‘Rolling emergency’ of locust swarms decimating Africa, Asia and Middle East en_GBR_guardi 2020-06-08 Q15 Q48 Q7204 3
Coronavirus pushes beyond Asia as it takes aim at Europe and Middle East en_NIR_beltel 2020-02-24 Q48 Q46 Q7204 3
Solar eclipse wows stargazers in Africa, Asia and the Middle East en_NIR_beltel 2020-06-21 Q15 Q48 Q7204 3
Avrasya Tüneli Avrupa Anadolu geçisi bir saat trafige kapatildi: Uzun araç kuyruklari olustu tr_TUR_yenisa 2020-05-28 Q5401 Q46 Q12824780 3
«L’émigration permanente vers l’Europe prive l’Afrique de ses jeunes les plus brillants» fr_FRA_figaro 2019-02-15 Q46 Q15 2
L’Amérique et l’Europe frappent la Syrie au portefeuille fr_FRA_figaro 2019-02-26 Q828 Q46 2
Sommet Trump-Kim à Hanoï: quels enjeux pour les alliés des États-Unis en Europe et en Asie? fr_FRA_figaro 2019-02-27 Q46 Q48 2
Des migrants venus d’Asie traversent la Méditerranée fr_FRA_figaro 2019-05-03 Q48 Q4918 2
Sept pays d’Amérique du Sud en sommet pour défendre l’Amazonie fr_FRA_figaro 2019-09-06 Q18 Q2841453 2

Experience 1 : An Inter-Language analysis of lexical universes

Experience 1 : Europe (Q46) in 2019

Experience 1 : Europe (Q46) in 2020

Experience 1 : Mediterranea (Q4918) in 2019-2021

Experience 2 : A Cross-Language analysis of regional entities

Experience 2 : Data aggregation

For the experience 2, we create a new object called hypercube where the text of news has been removed and where we keep only the number of tags or proportion of news speaking from one or several regions (where1, where2), by media (who) and by time period (when)

## Joining, by = "id"
who when where1 where2 tags news
fr_FRA_figaro 2019-01-01 Q46 Q15 2 0.3611111
fr_FRA_figaro 2020-01-01 Q46 Q15 2 0.5000000
fr_FRA_figaro 2021-01-01 Q46 Q15 1 0.2500000
de_DEU_frankf 2021-01-01 Q46 Q15 1 0.2500000
de_DEU_suddeu 2020-01-01 Q46 Q15 1 0.2500000
en_GBR_telegr 2020-01-01 Q46 Q15 1 0.2500000
en_IRL_irtime 2019-01-01 Q46 Q15 1 0.2500000
en_IRL_irtime 2020-01-01 Q46 Q15 1 0.2500000
en_IRL_irtime 2021-01-01 Q46 Q15 1 0.2500000
tr_TUR_cumhur 2020-01-01 Q46 Q15 1 0.2500000
tr_TUR_yenisa 2021-01-01 Q46 Q15 1 0.2500000
ar_TUN_babnet 2021-01-01 Q46 Q15 1 0.2500000
fr_TUN_ecomag 2019-01-01 Q46 Q15 1 0.2500000

Experience 2 : Top 20 regions in full corpus

We can propose firstly a table of top entities in the whole corpus of newspapers.

id de en fr tr nb
1 Q46 Europa Europe Europe Avrupa 4546
2 Q15 Afrika Africa Afrique Afrika 1022
3 Q4918 Mittelmeer Mediterranean Sea mer Méditerranée Akdeniz 912
4 Q7204 Mittlerer Osten Middle East Moyen-Orient Orta Dogu 332
5 Q48 Asien Asia Asie Asya 293
6 Q66065 Sahelzone Sahel Sahel Sahel 240
7 Q98 Pazifischer Ozean Pacific Ocean océan Pacifique Büyük Okyanus 200
8 Q25322 Arktis Arctic Arctique Arktika 180
9 Q97 Atlantischer Ozean Atlantic Ocean océan Atlantique Atlas Okyanusu 180
10 Q1286 Alpen Alps Alpes Alpler 174
11 Q28227 Maghreb Maghreb Maghreb Magrip 136
12 Q6583 Sahara Sahara Sahara Sahra 122
13 Q12585 Lateinamerika Latin America Amérique latine Latin Amerika 122
14 Q664609 Karibik Caribbean Caraïbes Karayipler 110
15 Q51 Antarktika Antarctica Antarctique Antarktika 105
16 Q2841453 Amazonien Amazonia Amazonie NA 104
17 Q48214 Naher Osten Near East Proche-Orient Yakin Dogu 84
18 Q35942 Polynesien Polynesia Polynésie Polinezya 82
19 Q18 Südamerika South America Amérique du Sud Güney Amerika 74
20 Q23522 Balkanhalbinsel Balkans Balkans Balkanlar 71

Experience 2 : Turkish newspapers - Top 10 regions

tab1 Cumhuryet_Region Cumhuryet pct Yeni Savas_Region Yeni Savas pct
1 Avrupa 64.5 Avrupa 53.4
2 Akdeniz 16.6 Akdeniz 24.1
3 Afrika 4.2 Afrika 7.1
4 Asya 2.9 Asya 3.0
5 Avrasya 2.4 Avrasya 2.4
6 Antarktika 1.5 Sahra 1.5
7 Sahra 1.2 Orta Dogu 1.5
8 Orta Dogu 1.1 Antarktika 1.3
9 Orta Asya 0.7 Kafkasya 1.1
10 Latin Amerika 0.6 Basra Körfezi 0.9

Experience 2 : German newspapers - Top 10 regions

tab1 FAZ_Region FAZ pct Süd. Zeit._Region Süd. Zeit. pct
1 Europa 57.4 Europa 49.4
2 Afrika 6.8 Afrika 8.3
3 Mittelmeer 5.0 Mittlerer Osten 8.1
4 Asien 4.2 Mittelmeer 7.4
5 Alpen 3.4 Alpen 4.8
6 Mittlerer Osten 2.4 Naher Osten 2.5
7 Osteuropa 2.0 Südamerika 2.4
8 Balkanhalbinsel 1.9 Lateinamerika 1.7
9 Südamerika 1.4 Arktis 1.7
10 Südostasien 1.4 Asien 1.5

Experience 2 : French newspapers - Top 10 regions

tab1 Figaro_Region Figaro pct Le Monde_Region Le Monde pct
1 Europe 39.5 Europe 27.2
2 mer Méditerranée 8.2 Afrique 18.5
3 Afrique 6.0 Sahel 9.9
4 Sahel 4.9 mer Méditerranée 8.6
5 Amazonie 4.6 Proche-Orient 4.2
6 Alpes 3.7 Moyen-Orient 3.8
7 Polynésie 3.4 Amazonie 2.5
8 Moyen-Orient 2.9 Alpes 2.5
9 océan Pacifique 2.5 Polynésie 2.5
10 Asie 2.2 Sahara 2.1

Experience 2 : UK newspapers - Top 10 regions

tab1 Guardian_Region Guardian pct Daily Telegraph_Region Daily Telegraph pct
1 Europe 36.6 Europe 50.8
2 Africa 9.6 Africa 12.6
3 Arctic 7.2 Asia 4.8
4 Pacific Ocean 7.0 Caribbean 4.0
5 Middle East 6.7 Pacific Ocean 3.3
6 Atlantic Ocean 4.4 Middle East 2.8
7 Asia 3.0 Southeast Asia 2.6
8 Latin America 2.9 Arctic 2.6
9 Antarctica 2.6 South China Sea 1.8
10 Caribbean 2.4 Atlantic Ocean 1.8

Experience 2 : Irish newspapers - Top 10 regions

tab1 Irish Times_Region Irish Times pct Belfast Telegraph_Region Belfast Telegraph pct
1 Europe 56.5 Europe 51.3
2 Atlantic Ocean 4.5 Africa 7.4
3 Africa 4.4 Atlantic Ocean 7.3
4 Asia 3.9 Arctic 5.7
5 Pacific Ocean 3.8 Middle East 4.6
6 Middle East 3.7 Asia 4.5
7 Caribbean 2.5 Caribbean 3.4
8 Maghreb 1.9 Pacific Ocean 3.1
9 Alps 1.8 South America 1.3
10 Latin America 1.6 Central America 1.3

Experience 2 : Tunisian Newspapers- Top 5 regions

Due to the limited number of news, only top 5 news is presented. The newspaper Babnet was in arabic language.

tab1 Babnet (ar)_Region Babnet (ar) pct Econ. Mag_Region Econ. Mag pct La Presse_Region La Presse pct Réalités_Region Réalités pct
1.0 Afrique 46.3 Afrique 34.3 mer Méditerranée 30.8 Afrique 40.0
2.0 Europe 22.7 Maghreb 18.6 Afrique 30.2 Europe 14.2
3.5 Maghreb 6.8 mer Méditerranée 14.7 Sahel 15.1 Maghreb 11.0
3.5 Sahel 6.8 Europe 11.9 Maghreb 6.4 Sahel 9.7
5.0 Moyen-Orient 5.5 Afrique du Nord 5.8 Europe 4.7 mer Méditerranée 9.0

Experience 2 : Correspondance analysis - Factor 1-2

## Joining, by = "id"

Experience 2 : Factors 3-4

Experience 2 : Cluster analysis(world regions)

Experience 2 : Cluster analysis (medias)

Bibliography

Cholley, André. 1939. “Régions naturelles et régions humaines.” L’information géographique 4 (2): 40–42. https://doi.org/10.3406/ingeo.1939.5013.